Regular expressions are another, powerful way of handling strings in Python. You can use them for all sorts of very common string operations, like searching words and replacing them in texts. In fact, regular expressions are often used across many programming languages and text editors. Because of that, you will often be able to reuse many of the things that you will learn about below. The functionality for using regular expressions in Python is included in the 're' package, which you should be able to import as usual:
In [2]:
import re
On many occasions, you will want to search a string in your scripts: e.g. does the following word appear in a text? Is the format of the following email address valid and does it contain an @-symbol and a least one dot? To carry out such operations, the first thing you need is a string to search:
In [2]:
s = "In principio erat verbum, et verbum erat apud Deum."
The next thing we define is the actual regular expression which we will use, or the string that we will use to search the sentence we defined above. We pass this string to the compile()
function in the re
package, which will allow fast searching later on. Note that we put an r
in front of this string when we initialize it, which turns our string into a so-called 'raw string'. While this is not always necessary, it is a good idea to do this consistently when dealing with regular expressions.
In [3]:
pattern = re.compile(r"verbum")
Next, we can call the sub()
function from the re
package on this pattern, in order to replace (or 'substitute') our pattern with another word, like this:
In [6]:
text = pattern.sub("XXX", s)
print(text)
In [7]:
help(re.sub)
In [ ]:
text = re.sub("ABC", )
Note the order of the arguments passed to sub()
: first, the word we would like to replace our pattern with, and secondly our original string. We can just as easily get back our original string:
In [ ]:
pattern2 = re.compile(r"XXX")
text = pattern2.sub("verbum", s)
print(text)
So far nothing special: we are simply replacing one word for another word. The smart ones among you will have noticed that we could have achieved the exact same result using the replace()
function, which we came across in an earlier chapter. But now: say you would like to replace all vowels in a string. With regular expressions, this is a piece of cake:
In [ ]:
vowel_pattern = re.compile(r"a|e|o|u|i")
without_vowels = vowel_pattern.sub("X", s)
print(without_vowels)
Note how our pattern allows for a special syntax: the pipe symbol which we used allows to express that one character OR another one is fine for the regular expression to match. Oops: the capital letter at the beginning of the sentence hasn't been replaced because we only included lowercase vowels in our pattern definition. Let's add the uppercase vowels to the regex:
In [ ]:
vowel_pattern = re.compile(r"a|A|e|E|o|O|u|U|i|I")
without_vowels = vowel_pattern.sub("X", s)
print(without_vowels)
There is in fact an easy way to match all lowercase and uppercase characters in a string, like this:
In [ ]:
ups = re.compile(r"[A-Z]")
lows = re.compile(r"[a-z]")
without_ups = ups.sub("X", s)
print(without_ups)
without_ups = lows.sub("X", s)
print(without_ups)
These specific patterns are called 'ranges': they will match any lowercase or uppercase letter. In fact, you can use such a range syntax using squared brackets, to replace the pipe syntax we used earlier.
In [9]:
vowel_pattern = re.compile(r"[aeoui]")
without_vowels = vowel_pattern.sub("M", s)
print(without_vowels)
In [ ]:
You can also look for more specific, as well as longer letter groups by arranging them with round brackets:
In [10]:
p = re.compile(r"(ri)|(um)|(Th)")
print(vowel_pattern.sub("A", s))
There is also a syntax to match any character (except the newline):
In [ ]:
any_char = re.compile(r".")
print(any_char.sub("X", s))
If you would like your expression to match an actual dot, you have to escape it using a backslash:
In [ ]:
dot = re.compile(r"\.")
print(dot.sub("X", s))
By the way, there exist more characters that you might have to escape using a backslash. This is because they are part of the syntax that use to define regular expressions: if you don't escape them, Python will not know that you are looking for an literal match. Characters that you typically might want to escape include: ( + ? . * ^ $ ( ) [ ] { } | \ ) ,. For example:
In [ ]:
s = "In principio [erat] verbum, et verbum erat apud Deum."
brackets_wrong = re.compile(r"[|]")
print(brackets_wrong.sub("X", s))
brackets_right = re.compile(r"(\[)|(\])")
print(brackets_right.sub("X", s))
The syntax for regular expression includes a whole range of possibilities which we simply cannot all deal with it here. Because of that we will stick to a number of helpful examples. An interesting feature is that you can specify whether or not a character really has to occur. You can check whether the pattern occurs in a string using the match()
function which will return None
if it doesn't find the pattern in the string searched:
In [ ]:
pattern = re.compile(r"m{2,4}")
print(pattern.match(""))
print(pattern.match("m"))
print(pattern.match("mm"))
print(pattern.match("mmm"))
print(pattern.match("mmmm"))
print(pattern.match("mmmmm"))
print(pattern.match("mmmmmm"))
print(pattern.match("mmmmammm"))
With the curly brackets, you indicate that you are only interested in the letter 'm' if it occurs 2,3 or 4 times in a row in the string you search. Because None
is returned if not a single match was found, you can use the outcome of match()
in an if-statement. The following example shows how you can also use the curly brackets to match an exact number of occurences (in this case three a's).
In [ ]:
pattern = re.compile(r"a{5}")
if pattern.match("aaaaa"):
print("Found it!")
else:
print("Nope...")
# or:
if pattern.match("aa"):
print("Found it!")
else:
print("Nope...")
Using a plus sign you can indicate whether you want to match multiple occurrences of a character. A good example from the world of paper writing are double spaces, which can be hard to spot. In the code block below, we replace all multiple occurences of a whitespace character by a single whitespace character. Note that you can use the whitespace character just like any other character (you don't have to escape it). Multiple occurences of the whitespace character will be matched: it doesn't matter how many, as long as there is at least one:
In [ ]:
paper = "My thesis on biology contains a lot of double spaces. I will remove them."
mult = re.compile(r" +")
print(mult.sub(" ", paper))
A similar piece of functionality is offered by the asterisk operator: here you can match multiple occurences of the same character in a row OR not a single one. Note the subtle difference with respect to the plus operator, which needs at least a single occurence of the character to match. Here we use the search()
function which will search the entire string: the match()
function which we used earlier will only look for matches at the very beginning of a string. Keep this in mind! The final pattern below yields a match, although there is not a single 'x' in the sentence. That is because the pattern with the asterisk says: "a single x, or no x at all".
In [18]:
s = "In English some letters occur multiple times in a row."
p1 = re.compile(r"t")
p2 = re.compile(r"t+")
p3 = re.compile(r"x")
p4 = re.compile(r"x*")
print(p1.search(s))
print(p2.search(s))
print(p3.search(s))
print(p4.search(s))
In [20]:
print(p2.search(s))
Interestingly, you also use regular expression to search inside words. Can you explain why the following patterns (don't) match?
In [ ]:
candidates = ["good", "god", "gud", "gd"]
p = re.compile(r"go+d")
for c in candidates:
print(p.match(c))
Speaking of words: it might be interesting to know that you can use regular expressions for advanced string splitting. If you want to split a sentence across all whitespace characters for instance, you can use an espaced \s
. This operator will match all whitespace characters, such as tabs, linebreaks, normal spaces etc. If you add a +
sign, your pattern will match series of whitespace characters:
In [ ]:
s = """This is a text on three lines
with multiple instances of
double spaces."""
whitespace = re.compile(r"\s+")
print(whitespace.split(s))
If you would have wanted to split on the linebreaks only (possible followed by e.g. spaces), you could have used the following pattern:
In [21]:
s = """This is a text on three lines
with multiple instances of
double spaces."""
whitespace = re.compile(r"\s*\n\s*")
print(whitespace.split(s))
If we want to correct the double spaces, we could now do:
In [ ]:
ds = re.compile(r" +")
for line in whitespace.split(s):
print ds.sub(" ", line)
One final feature we should mention is the [^...]
syntax: this will match any character that is NOT between the brackets. Remember the vowel_pattern above? Using the caret symbol we can quickly 'invert' this pattern, so that it will match all consonants:
In [ ]:
s = "these are vowels and consonants"
consonants = re.compile(r"[^aeuoi]")
print(consonants.sub("X", s))
Regular expressions are really useful, but they can get tricky as well as difficult to read, because of the many different options that exist. There is a whole range of special symbols which you can use to match nearly everything in a text, from word boundaries (\b) to digits (\d) etc. Don't learn these by heart but look up a good reference list online (like http://www.tutorialspoint.com/python/python_reg_expressions.htm). As usual Stackoverflow will prove really useful when you search for information online.
Example data:
color = red
number =7
name= joe
age = 9 ...
Example data:
Mike, 28, scientist, Belgium
Lars, 49, research director, Luxemburg
Matt, 52, rockstar, US
Example output: [["Mike","28","scientist","Belgium"],["Lars","49","research director","Luxemburg"], ...]
Example data: name, age, profession, country
Mike, 28, scientist, Belgium
Lars, 49, research director, Luxemburg
Matt, 52, rockstar, US ... Example output: [{"name": "Mike", "age": "28", "profession":"scientist", "country":"Belgium"}, {"name": "Lars", "age": "49", "profession":"research director", "country":"Luxemburg"]}, ...]